A Novel Method of Text Clustering for Chinese Spam Based on Semantic Body
نویسندگان
چکیده
The effect of spam filtering method based on statistics is not good in filtering the new-type spam with synonymous substitution and camouflage. So a new text clustering method based on Semantic Body for filtering Chinese spam is proposed. In this paper, the word sense disambiguation, lexical chain based on HowNet and statistic-based TFIDF are adopted to extract features of mails. The Semantic Body is obtained from the process. The text clustering based on semantic distance is utilized to dealing with Semantic Body. The experimental results under CCERT Chinese-rules.cf show that the proposed approach has a good performance for new type Chinese text spam in filtering.
منابع مشابه
Fuzzy Clustering based on Semantic Body and its Application in Chinese Spam Filtering
E-mail’s text is the main body of an E-mail. Its content is reflected by semantic body formed by a large number of semantic elements, so it is the most authoritative and effective to study semantic body information of spam when analyzing its text. Firstly, this paper takes the advantage of HowNet in analysis of semantic element and analyze semantic bodies in email text, then proposes the method...
متن کاملApplications of Text Clustering Based on Semantic Body for Chinese Spam Filtering
The effect of spam filtering method based on statistics is not good enough in filtering the new-type spam with synonymous substitution and camouflage, because the method based on statistics ignores the semantic relation between words in the text, and only judges from the word itself. So, a method of spam filtering based on the semantic body is proposed in this paper. The method adopts lexical c...
متن کاملA Joint Semantic Vector Representation Model for Text Clustering and Classification
Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...
متن کاملWavelet Packet Transform-Based Algorithm for Mixing Matrix Estimation
REGULAR PAPERS Wavelet Packet Transform-Based Algorithm for Mixing Matrix Estimation Yujie Zhang, Huiming Peng, and Hongwei Li Applications of Text Clustering Based on Semantic Body for Chinese Spam Filtering Qiu-yu Zhang, Peng Wang, and Hui-juan Yang Uncertainty Time Series' Multi-Scale Fractional-Order Association Model Yuran Liu, Mingliang Hou, and Yanglie Fu Evaluation of OpenID-Based Doubl...
متن کاملA Novel Method of Spam Mail Detection using Text Based Clustering Approach
A novel method of efficient spam mail classification using clustering techniques is presented in this research paper. E-mail spam is one of the major problems of the today’s internet, bringing financial damage to companies and annoying individual users. Among the approaches developed to stop spam, filtering is an important and popular one. A new spam detection technique using the text clusterin...
متن کامل